Provide an overview of the project goals and motivation
The motivation for a project exploring the correlation between low birth weight, premature birth rate, cancer mortality rate, cardiovascular disease hospitalization rate, asthma hospitalization rate and PM2.5 concentration, as well as socioeconomic, racial, education, and gender factors across the 62 counties of New York State, stems from a compelling public health imperative.
All the outcomes of interest are critical indicators of population health. By examining the impact of fine particulate pollution, this project aims to uncover environmental health disparities that disproportionately affect under-served communities.
It seeks to inform public health policy by highlighting the need for targeted interventions that could mitigate the adverse effects of air pollution on vulnerable populations, including pregnant women and newborns. The analysis of socioeconomic and racial factors will offer insights into the complex interplay between environment and social determinants of health, potentially guiding urban planning and healthcare resource allocation.
Furthermore, by considering gender-specific vulnerabilities,the project aims to provide a comprehensive overview of the risk landscape, fostering community awareness and prompting action to improve air quality and health equity in one of the most densely populated urban areas in the world.
We aim to assess the impact of PM2.5 concentration on various health outcomes across New York state, encompassing neonatal health to the well-being of mature individuals. Our investigation involves not only preliminary data analysis but also the implementation of regression models, accounting for demographic characteristics. Through this comprehensive approach, we seek to understand the individual contributions of different variables to health outcomes. Ultimately, our goal is to identify the most effective model for predicting these health outcomes, with the potential to provide valuable insights for public health benefits.
Source, scraping method, cleaning, etc.
Click show on the right to view the code chunk for data
importing.
lowbirthweight <- read_csv("csv_NYC_lowbirthweight.csv")
pm2_5 <- read_csv("pm2.5.csv")
edu_NY <- read_excel("Edu_NY.xlsx")
race_NY <- read_excel("Race_NY.xlsx")
HHincome_NY <- read_excel("HHincome_NY.xlsx")
Age_NY <- read_excel("Age_NY.xlsx")
Sex_NY <- read_excel("Sex_NY.xlsx")
health <- read_excel("chir_current_data.xlsx")
uscounties <- read_csv("uscounties.csv") #Simplemaps.com
9 data files are obtained from publicly available sources online. These will be imported, wrangled properly, and merged to create a complete data for further analysis.
Click show on the right to view the code chunk for data
importing.
pm2_5 <- pm2_5 %>%
janitor::clean_names()%>%
select (county,value) %>%
rename (annual_pm2.5 = "value")
pm2_5 %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | annual_pm2.5 |
|---|---|
| Albany | 7.9 |
| Allegany | 6.5 |
| Bronx | 8.3 |
| Broome | 6.9 |
| Cattaraugus | 7.3 |
| Cayuga | 6.4 |
| Chautauqua | 6.6 |
| Chemung | 6.2 |
| Chenango | 6.5 |
| Clinton | 7.0 |
| Columbia | 7.0 |
| Cortland | 6.1 |
| Delaware | 6.6 |
| Dutchess | 7.7 |
| Erie | 7.6 |
| Essex | 3.4 |
| Franklin | 7.4 |
| Fulton | 6.7 |
| Genesee | 7.3 |
| Greene | 6.8 |
| Hamilton | 6.0 |
| Herkimer | 6.5 |
| Jefferson | 6.8 |
| Kings | 7.9 |
| Lewis | 6.1 |
| Livingston | 6.7 |
| Madison | 6.1 |
| Monroe | 7.4 |
| Montgomery | 6.9 |
| Nassau | 8.7 |
| New York | 10.4 |
| Niagara | 8.7 |
| Oneida | 6.5 |
| Onondaga | 5.1 |
| Ontario | 6.6 |
| Orange | 6.4 |
| Orleans | 7.3 |
| Oswego | 6.4 |
| Otsego | 6.4 |
| Putnam | 7.9 |
| Queens | 8.3 |
| Rensselaer | 7.5 |
| Richmond | 7.8 |
| Rockland | 8.8 |
| Saratoga | 7.7 |
| Schenectady | 7.5 |
| Schoharie | 6.5 |
| Schuyler | 6.0 |
| Seneca | 6.4 |
| St. Lawrence | 7.7 |
| Steuben | 4.8 |
| Suffolk | 6.8 |
| Sullivan | 7.2 |
| Tioga | 6.6 |
| Tompkins | 6.2 |
| Ulster | 7.4 |
| Warren | 7.2 |
| Washington | 7.3 |
| Wayne | 6.7 |
| Westchester | 9.3 |
| Wyoming | 7.0 |
| Yates | 6.0 |
The dataset is obtained from EPH Tracking Website from CDC (https://ephtracking.cdc.gov/DataExplorer/). This has 62
rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- annual_pm2.5: annual estimated PM2.5 concentration at
each county (ug/m^3)
The following set of demographic datasets is obtained from Census Reporter webpage (https://censusreporter.org/).
edu <- edu_NY %>%
janitor::clean_names()%>%
mutate (percentage_high_education = (bach_male+master_male+prof_male+doct_male+bach_female+master_female+prof_female+doct_female)/total) %>%
filter(str_detect (name, " County, NY")) %>%
mutate(county = str_replace (name, " County, NY", "")) %>%
select(county,percentage_high_education) %>%
mutate(county = str_replace (county, "St.", "St")) %>%
mutate(county = str_replace (county, "Stuben", "Steuben"))%>%
select(county,percentage_high_education)
edu %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | percentage_high_education |
|---|---|
| Albany | 0.443 |
| Allegany | 0.224 |
| Bronx | 0.209 |
| Broome | 0.289 |
| Cattaraugus | 0.200 |
| Cayuga | 0.222 |
| Chautauqua | 0.233 |
| Chemung | 0.239 |
| Chenango | 0.183 |
| Clinton | 0.254 |
| Columbia | 0.349 |
| Cortland | 0.281 |
| Delaware | 0.230 |
| Dutchess | 0.377 |
| Erie | 0.359 |
| Essex | 0.305 |
| Franklin | 0.216 |
| Fulton | 0.182 |
| Genesee | 0.220 |
| Greene | 0.256 |
| Hamilton | 0.205 |
| Herkimer | 0.214 |
| Jefferson | 0.243 |
| Kings | 0.395 |
| Lewis | 0.188 |
| Livingston | 0.289 |
| Madison | 0.262 |
| Monroe | 0.400 |
| Montgomery | 0.192 |
| Nassau | 0.475 |
| New York | 0.626 |
| Niagara | 0.263 |
| Oneida | 0.262 |
| Onondaga | 0.365 |
| Ontario | 0.375 |
| Orange | 0.314 |
| Orleans | 0.172 |
| Oswego | 0.208 |
| Otsego | 0.333 |
| Putnam | 0.415 |
| Queens | 0.339 |
| Rensselaer | 0.343 |
| Richmond | 0.349 |
| Rockland | 0.420 |
| St Lawrence | 0.236 |
| Saratoga | 0.427 |
| Schenectady | 0.333 |
| Schoharie | 0.232 |
| Schuyler | 0.243 |
| Seneca | 0.215 |
| Steuben | 0.255 |
| Suffolk | 0.383 |
| Sullivan | 0.270 |
| Tioga | 0.258 |
| Tompkins | 0.553 |
| Ulster | 0.345 |
| Warren | 0.331 |
| Washington | 0.206 |
| Wayne | 0.244 |
| Westchester | 0.506 |
| Wyoming | 0.181 |
| Yates | 0.245 |
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percentage_high_education: percentage of population who
finished a higher education level (higher than a bachelor degree)
race_non_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_white") %>%
select (county,percent_non_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_non_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_black") %>%
select (county,percent_non_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_white") %>%
select (county,percent_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_black") %>%
select (county,percent_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_merge <- race_non_hisp_white %>%
inner_join(race_non_hisp_black, by = "county") %>%
inner_join(race_hisp_white, by = "county") %>%
inner_join(race_hisp_black, by = "county")
race_merge <- race_merge %>%
mutate (percent_other = 1 - percent_non_hisp_white - percent_non_hisp_black - percent_hisp_white - percent_hisp_black)
race_merge %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | percent_non_hisp_white | percent_non_hisp_black | percent_hisp_white | percent_hisp_black | percent_other |
|---|---|---|---|---|---|
| Albany | 0.710 | 0.114 | 0.027 | 0.007 | 0.142 |
| Allegany | 0.941 | 0.018 | 0.009 | 0.001 | 0.031 |
| Bronx | 0.090 | 0.285 | 0.109 | 0.060 | 0.456 |
| Broome | 0.824 | 0.050 | 0.018 | 0.003 | 0.105 |
| Cattaraugus | 0.897 | 0.013 | 0.013 | 0.001 | 0.076 |
| Cayuga | 0.888 | 0.039 | 0.013 | 0.001 | 0.059 |
| Chautauqua | 0.865 | 0.023 | 0.033 | 0.005 | 0.074 |
| Chemung | 0.853 | 0.058 | 0.016 | 0.002 | 0.072 |
| Chenango | 0.941 | 0.006 | 0.014 | 0.001 | 0.038 |
| Clinton | 0.897 | 0.039 | 0.007 | 0.001 | 0.056 |
| Columbia | 0.848 | 0.039 | 0.022 | 0.002 | 0.089 |
| Cortland | 0.914 | 0.019 | 0.016 | 0.000 | 0.051 |
| Delaware | 0.912 | 0.016 | 0.022 | 0.002 | 0.048 |
| Dutchess | 0.699 | 0.099 | 0.053 | 0.007 | 0.141 |
| Erie | 0.746 | 0.124 | 0.020 | 0.004 | 0.107 |
| Essex | 0.912 | 0.029 | 0.010 | 0.003 | 0.046 |
| Franklin | 0.811 | 0.051 | 0.008 | 0.001 | 0.129 |
| Fulton | 0.906 | 0.014 | 0.016 | 0.001 | 0.063 |
| Genesee | 0.900 | 0.020 | 0.012 | 0.001 | 0.066 |
| Greene | 0.838 | 0.053 | 0.035 | 0.004 | 0.070 |
| Hamilton | 0.921 | 0.010 | 0.011 | 0.000 | 0.058 |
| Herkimer | 0.932 | 0.012 | 0.012 | 0.000 | 0.045 |
| Jefferson | 0.809 | 0.055 | 0.039 | 0.004 | 0.093 |
| Kings | 0.367 | 0.286 | 0.054 | 0.018 | 0.274 |
| Lewis | 0.946 | 0.009 | 0.005 | 0.007 | 0.032 |
| Livingston | 0.895 | 0.026 | 0.017 | 0.002 | 0.060 |
| Madison | 0.917 | 0.015 | 0.011 | 0.000 | 0.056 |
| Monroe | 0.698 | 0.140 | 0.041 | 0.009 | 0.113 |
| Montgomery | 0.794 | 0.026 | 0.056 | 0.004 | 0.120 |
| Nassau | 0.583 | 0.110 | 0.063 | 0.006 | 0.239 |
| New York | 0.467 | 0.121 | 0.068 | 0.019 | 0.324 |
| Niagara | 0.845 | 0.066 | 0.017 | 0.002 | 0.071 |
| Oneida | 0.811 | 0.057 | 0.027 | 0.003 | 0.101 |
| Onondaga | 0.760 | 0.103 | 0.020 | 0.006 | 0.110 |
| Ontario | 0.891 | 0.020 | 0.022 | 0.003 | 0.064 |
| Orange | 0.622 | 0.100 | 0.085 | 0.010 | 0.183 |
| Orleans | 0.857 | 0.056 | 0.021 | 0.001 | 0.065 |
| Oswego | 0.928 | 0.010 | 0.012 | 0.001 | 0.050 |
| Otsego | 0.904 | 0.018 | 0.018 | 0.002 | 0.058 |
| Putnam | 0.764 | 0.029 | 0.068 | 0.006 | 0.133 |
| Queens | 0.246 | 0.168 | 0.090 | 0.010 | 0.486 |
| Rensselaer | 0.820 | 0.061 | 0.020 | 0.004 | 0.094 |
| Richmond | 0.596 | 0.091 | 0.097 | 0.010 | 0.206 |
| Rockland | 0.629 | 0.108 | 0.065 | 0.007 | 0.191 |
| St Lawrence | 0.917 | 0.025 | 0.006 | 0.001 | 0.050 |
| Saratoga | 0.894 | 0.015 | 0.019 | 0.002 | 0.069 |
| Schenectady | 0.716 | 0.094 | 0.024 | 0.005 | 0.162 |
| Schoharie | 0.910 | 0.013 | 0.015 | 0.003 | 0.059 |
| Schuyler | 0.944 | 0.009 | 0.009 | 0.001 | 0.038 |
| Seneca | 0.879 | 0.050 | 0.018 | 0.002 | 0.051 |
| Steuben | 0.929 | 0.016 | 0.009 | 0.001 | 0.046 |
| Suffolk | 0.664 | 0.069 | 0.098 | 0.006 | 0.162 |
| Sullivan | 0.702 | 0.080 | 0.066 | 0.005 | 0.147 |
| Tioga | 0.943 | 0.009 | 0.010 | 0.000 | 0.037 |
| Tompkins | 0.769 | 0.035 | 0.023 | 0.005 | 0.168 |
| Ulster | 0.760 | 0.056 | 0.040 | 0.004 | 0.139 |
| Warren | 0.930 | 0.012 | 0.014 | 0.002 | 0.043 |
| Washington | 0.911 | 0.027 | 0.011 | 0.003 | 0.048 |
| Wayne | 0.889 | 0.021 | 0.019 | 0.001 | 0.071 |
| Westchester | 0.524 | 0.133 | 0.075 | 0.013 | 0.255 |
| Wyoming | 0.888 | 0.040 | 0.015 | 0.002 | 0.056 |
| Yates | 0.948 | 0.008 | 0.015 | 0.000 | 0.029 |
This has 62 rows and 6 columns of data. In which, 6 variables
are:
- county: NY county name
- percent_non_hisp_white: percentage of population who are
identified as Non-Hispanic White
- percent_non_hisp_black: percentage of population who are
identified as Non-Hispanic Black
- percent_hisp_white: percentage of population who are
identified as Hispanic White
- percent_hisp_black: percentage of population who are
identified as Hispanic Black
- percent_other: percentage of population who are
identified as any other ethinic groups
income <- HHincome_NY %>%
janitor::clean_names() %>%
filter (x1 == "percent_high_income") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_high_income") %>%
select (county,percent_high_income) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_high_income ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_merge %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | percent_non_hisp_white | percent_non_hisp_black | percent_hisp_white | percent_hisp_black | percent_other |
|---|---|---|---|---|---|
| Albany | 0.710 | 0.114 | 0.027 | 0.007 | 0.142 |
| Allegany | 0.941 | 0.018 | 0.009 | 0.001 | 0.031 |
| Bronx | 0.090 | 0.285 | 0.109 | 0.060 | 0.456 |
| Broome | 0.824 | 0.050 | 0.018 | 0.003 | 0.105 |
| Cattaraugus | 0.897 | 0.013 | 0.013 | 0.001 | 0.076 |
| Cayuga | 0.888 | 0.039 | 0.013 | 0.001 | 0.059 |
| Chautauqua | 0.865 | 0.023 | 0.033 | 0.005 | 0.074 |
| Chemung | 0.853 | 0.058 | 0.016 | 0.002 | 0.072 |
| Chenango | 0.941 | 0.006 | 0.014 | 0.001 | 0.038 |
| Clinton | 0.897 | 0.039 | 0.007 | 0.001 | 0.056 |
| Columbia | 0.848 | 0.039 | 0.022 | 0.002 | 0.089 |
| Cortland | 0.914 | 0.019 | 0.016 | 0.000 | 0.051 |
| Delaware | 0.912 | 0.016 | 0.022 | 0.002 | 0.048 |
| Dutchess | 0.699 | 0.099 | 0.053 | 0.007 | 0.141 |
| Erie | 0.746 | 0.124 | 0.020 | 0.004 | 0.107 |
| Essex | 0.912 | 0.029 | 0.010 | 0.003 | 0.046 |
| Franklin | 0.811 | 0.051 | 0.008 | 0.001 | 0.129 |
| Fulton | 0.906 | 0.014 | 0.016 | 0.001 | 0.063 |
| Genesee | 0.900 | 0.020 | 0.012 | 0.001 | 0.066 |
| Greene | 0.838 | 0.053 | 0.035 | 0.004 | 0.070 |
| Hamilton | 0.921 | 0.010 | 0.011 | 0.000 | 0.058 |
| Herkimer | 0.932 | 0.012 | 0.012 | 0.000 | 0.045 |
| Jefferson | 0.809 | 0.055 | 0.039 | 0.004 | 0.093 |
| Kings | 0.367 | 0.286 | 0.054 | 0.018 | 0.274 |
| Lewis | 0.946 | 0.009 | 0.005 | 0.007 | 0.032 |
| Livingston | 0.895 | 0.026 | 0.017 | 0.002 | 0.060 |
| Madison | 0.917 | 0.015 | 0.011 | 0.000 | 0.056 |
| Monroe | 0.698 | 0.140 | 0.041 | 0.009 | 0.113 |
| Montgomery | 0.794 | 0.026 | 0.056 | 0.004 | 0.120 |
| Nassau | 0.583 | 0.110 | 0.063 | 0.006 | 0.239 |
| New York | 0.467 | 0.121 | 0.068 | 0.019 | 0.324 |
| Niagara | 0.845 | 0.066 | 0.017 | 0.002 | 0.071 |
| Oneida | 0.811 | 0.057 | 0.027 | 0.003 | 0.101 |
| Onondaga | 0.760 | 0.103 | 0.020 | 0.006 | 0.110 |
| Ontario | 0.891 | 0.020 | 0.022 | 0.003 | 0.064 |
| Orange | 0.622 | 0.100 | 0.085 | 0.010 | 0.183 |
| Orleans | 0.857 | 0.056 | 0.021 | 0.001 | 0.065 |
| Oswego | 0.928 | 0.010 | 0.012 | 0.001 | 0.050 |
| Otsego | 0.904 | 0.018 | 0.018 | 0.002 | 0.058 |
| Putnam | 0.764 | 0.029 | 0.068 | 0.006 | 0.133 |
| Queens | 0.246 | 0.168 | 0.090 | 0.010 | 0.486 |
| Rensselaer | 0.820 | 0.061 | 0.020 | 0.004 | 0.094 |
| Richmond | 0.596 | 0.091 | 0.097 | 0.010 | 0.206 |
| Rockland | 0.629 | 0.108 | 0.065 | 0.007 | 0.191 |
| St Lawrence | 0.917 | 0.025 | 0.006 | 0.001 | 0.050 |
| Saratoga | 0.894 | 0.015 | 0.019 | 0.002 | 0.069 |
| Schenectady | 0.716 | 0.094 | 0.024 | 0.005 | 0.162 |
| Schoharie | 0.910 | 0.013 | 0.015 | 0.003 | 0.059 |
| Schuyler | 0.944 | 0.009 | 0.009 | 0.001 | 0.038 |
| Seneca | 0.879 | 0.050 | 0.018 | 0.002 | 0.051 |
| Steuben | 0.929 | 0.016 | 0.009 | 0.001 | 0.046 |
| Suffolk | 0.664 | 0.069 | 0.098 | 0.006 | 0.162 |
| Sullivan | 0.702 | 0.080 | 0.066 | 0.005 | 0.147 |
| Tioga | 0.943 | 0.009 | 0.010 | 0.000 | 0.037 |
| Tompkins | 0.769 | 0.035 | 0.023 | 0.005 | 0.168 |
| Ulster | 0.760 | 0.056 | 0.040 | 0.004 | 0.139 |
| Warren | 0.930 | 0.012 | 0.014 | 0.002 | 0.043 |
| Washington | 0.911 | 0.027 | 0.011 | 0.003 | 0.048 |
| Wayne | 0.889 | 0.021 | 0.019 | 0.001 | 0.071 |
| Westchester | 0.524 | 0.133 | 0.075 | 0.013 | 0.255 |
| Wyoming | 0.888 | 0.040 | 0.015 | 0.002 | 0.056 |
| Yates | 0.948 | 0.008 | 0.015 | 0.000 | 0.029 |
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_high_income: percentage of population who are at
higher income households (>$75,000 annually)
age <- Age_NY %>%
janitor::clean_names() %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "median_age") %>%
select (county,median_age) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,median_age ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
age %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | median_age |
|---|---|
| Albany | 38.0 |
| Allegany | 39.2 |
| Bronx | 34.8 |
| Broome | 39.7 |
| Cattaraugus | 42.3 |
| Cayuga | 43.3 |
| Chautauqua | 42.8 |
| Chemung | 41.4 |
| Chenango | 44.7 |
| Clinton | 40.7 |
| Columbia | 48.9 |
| Cortland | 36.5 |
| Delaware | 48.5 |
| Dutchess | 42.5 |
| Erie | 40.2 |
| Essex | 48.3 |
| Franklin | 40.6 |
| Fulton | 44.3 |
| Genesee | 42.6 |
| Greene | 46.6 |
| Hamilton | 56.4 |
| Herkimer | 44.3 |
| Jefferson | 32.9 |
| Kings | 35.7 |
| Lewis | 42.0 |
| Livingston | 41.7 |
| Madison | 42.5 |
| Monroe | 39.2 |
| Montgomery | 40.5 |
| Nassau | 41.8 |
| New York | 38.1 |
| Niagara | 43.3 |
| Oneida | 40.8 |
| Onondaga | 39.3 |
| Ontario | 43.6 |
| Orange | 37.0 |
| Orleans | 43.2 |
| Oswego | 40.7 |
| Otsego | 42.2 |
| Putnam | 44.2 |
| Queens | 39.6 |
| Rensselaer | 40.1 |
| Richmond | 40.4 |
| Rockland | 35.6 |
| St Lawrence | 39.0 |
| Saratoga | 43.2 |
| Schenectady | 39.9 |
| Schoharie | 45.7 |
| Schuyler | 47.1 |
| Seneca | 42.5 |
| Steuben | 42.9 |
| Suffolk | 41.7 |
| Sullivan | 42.3 |
| Tioga | 44.5 |
| Tompkins | 31.8 |
| Ulster | 44.2 |
| Warren | 46.8 |
| Washington | 44.4 |
| Wayne | 43.9 |
| Westchester | 41.2 |
| Wyoming | 42.6 |
| Yates | 40.9 |
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- median_age: median age of the population at each
county
sex <- Sex_NY %>%
janitor::clean_names() %>%
filter (x1 == "Male:") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_male") %>%
select (county,percent_male) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_male ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
sex%>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | percent_male |
|---|---|
| Albany | 0.486 |
| Allegany | 0.506 |
| Bronx | 0.474 |
| Broome | 0.495 |
| Cattaraugus | 0.501 |
| Cayuga | 0.516 |
| Chautauqua | 0.497 |
| Chemung | 0.499 |
| Chenango | 0.502 |
| Clinton | 0.517 |
| Columbia | 0.503 |
| Cortland | 0.493 |
| Delaware | 0.507 |
| Dutchess | 0.500 |
| Erie | 0.487 |
| Essex | 0.522 |
| Franklin | 0.546 |
| Fulton | 0.501 |
| Genesee | 0.504 |
| Greene | 0.527 |
| Hamilton | 0.506 |
| Herkimer | 0.499 |
| Jefferson | 0.526 |
| Kings | 0.476 |
| Lewis | 0.504 |
| Livingston | 0.508 |
| Madison | 0.496 |
| Monroe | 0.485 |
| Montgomery | 0.496 |
| Nassau | 0.491 |
| New York | 0.476 |
| Niagara | 0.490 |
| Oneida | 0.502 |
| Onondaga | 0.485 |
| Ontario | 0.492 |
| Orange | 0.504 |
| Orleans | 0.503 |
| Oswego | 0.503 |
| Otsego | 0.488 |
| Putnam | 0.503 |
| Queens | 0.488 |
| Rensselaer | 0.497 |
| Richmond | 0.488 |
| Rockland | 0.494 |
| St Lawrence | 0.513 |
| Saratoga | 0.496 |
| Schenectady | 0.490 |
| Schoharie | 0.500 |
| Schuyler | 0.504 |
| Seneca | 0.527 |
| Steuben | 0.502 |
| Suffolk | 0.495 |
| Sullivan | 0.514 |
| Tioga | 0.499 |
| Tompkins | 0.497 |
| Ulster | 0.499 |
| Warren | 0.495 |
| Washington | 0.522 |
| Wayne | 0.500 |
| Westchester | 0.487 |
| Wyoming | 0.534 |
| Yates | 0.483 |
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_male: percentage of population who are identified
as male
lowbirthweight <- lowbirthweight %>%
janitor::clean_names()%>%
select (region_county, percentage)%>%
rename (county = "region_county", percent_lowbirthweight = "percentage")
lowbirthweight %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | percent_lowbirthweight |
|---|---|
| Nassau | 7.9 |
| Suffolk | 8.0 |
| Bronx | 10.2 |
| Kings | 7.8 |
| New York | 8.0 |
| Queens | 8.7 |
| Richmond | 7.8 |
| Dutchess | 7.3 |
| Orange | 7.0 |
| Putnam | 6.6 |
| Rockland | 5.6 |
| Sullivan | 8.6 |
| Ulster | 7.7 |
| Westchester | 7.8 |
| Albany | 8.2 |
| Columbia | 6.9 |
| Greene | 6.5 |
| Rensselaer | 8.6 |
| Saratoga | 6.5 |
| Schenectady | 9.2 |
| Fulton | 8.8 |
| Herkimer | 6.4 |
| Montgomery | 8.0 |
| Otsego | 5.6 |
| Schoharie | 5.1 |
| Clinton | 8.1 |
| Essex | 7.3 |
| Franklin | 6.7 |
| Hamilton | 0.0 |
| Warren | 6.9 |
| Washington | 8.3 |
| Jefferson | 7.3 |
| Lewis | 6.1 |
| St. Lawrence | 7.2 |
| Cayuga | 7.2 |
| Cortland | 7.5 |
| Madison | 7.4 |
| Oneida | 7.9 |
| Onondaga | 8.2 |
| Oswego | 6.0 |
| Broome | 7.9 |
| Chenango | 7.6 |
| Delaware | 7.2 |
| Tioga | 7.1 |
| Tompkins | 5.7 |
| Chemung | 7.9 |
| Livingston | 7.2 |
| Monroe | 9.1 |
| Ontario | 7.0 |
| Schuyler | 2.8 |
| Seneca | 6.8 |
| Steuben | 7.2 |
| Wayne | 6.8 |
| Yates | 4.0 |
| Allegany | 6.7 |
| Cattaraugus | 7.8 |
| Chautauqua | 8.4 |
| Erie | 8.8 |
| Genesee | 6.8 |
| Niagara | 8.2 |
| Orleans | 6.5 |
| Wyoming | 8.2 |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
This has 87 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_lowbirthweight: percentage of children being born
being identified as low birth weight (<2,500g) (https://www.health.ny.gov/)
The following dataset is obtained from New York State Department of Health (https://www.health.ny.gov/) that contains 4 different health indicators that will complement with the low birthweight. These 5 will act as our outcomes of interest for our regression models.
health <- health %>%
janitor::clean_names()%>%
select (geographic_area,indicator_title,topic_area,rate_percent,measurement) %>%
filter( str_detect (geographic_area, " County")) %>%
mutate (county = str_replace (geographic_area, " County", "")) %>%
select (county, everything()) %>%
select (-geographic_area) %>%
filter(topic_area == "Cancer Indicators" | topic_area == "Respiratory Disease Indicators" | topic_area == "Cardiovascular Disease Indicators" | topic_area == "Maternal and Infant Health Indicators")
cancer <- health %>%
filter (topic_area == "Cancer Indicators") %>%
filter (indicator_title == "All cancer incidence rate per 100,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
mutate(rate_percent = as.numeric(rate_percent)/10) %>%
rename (cancer_mortality_per_10k = "rate_percent")
resp <- health %>%
filter (topic_area == "Respiratory Disease Indicators") %>%
filter (indicator_title == "Asthma hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (asthma_hosp_rate_per_10k = "rate_percent")
cardio <- health %>%
filter (topic_area == "Cardiovascular Disease Indicators") %>%
filter (indicator_title == "Cardiovascular disease hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (cardio_hosp_rate_per_10k = "rate_percent")
maternal <- health %>%
filter (topic_area == "Maternal and Infant Health Indicators") %>%
filter (indicator_title == "Percentage of premature births with <37 weeks gestation") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (premature_percentage = "rate_percent")
health_merge <- maternal %>%
inner_join(cancer, by = "county") %>%
inner_join(resp, by = "county") %>%
inner_join(cardio, by = "county")
health_merge %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | premature_percentage | cancer_mortality_per_10k | asthma_hosp_rate_per_10k | cardio_hosp_rate_per_10k |
|---|---|---|---|---|
| Albany | 9.7 | 65.59 | 5.7 | 148.9 |
| Allegany | 8.8 | 66.18 | 3.1 | 151.8 |
| Broome | 9.4 | 69.59 | 5.0 | 171.8 |
| Cattaraugus | 8.4 | 68.99 | 1.9 | 159.0 |
| Cayuga | 8.8 | 72.01 | 2.2 | 169.5 |
| Chautauqua | 9.3 | 73.68 | 2.6 | 90.2 |
| Chemung | 8.4 | 72.31 | 4.2 | 161.9 |
| Chenango | 8.2 | 73.57 | 2.0 | 177.5 |
| Clinton | 8.7 | 68.46 | 2.5 | 147.1 |
| Columbia | 9.0 | 82.23 | 4.7 | 168.6 |
| Cortland | 8.9 | 68.34 | 4.7 | 158.7 |
| Delaware | 8.7 | 75.42 | 2.8 | 181.5 |
| Dutchess | 9.0 | 62.68 | 6.5 | 141.6 |
| Erie | 10.2 | 71.48 | 5.1 | 159.5 |
| Essex | 8.9 | 72.62 | 0.5* | 105.8 |
| Franklin | 8.5 | 67.53 | 1.9 | 117.5 |
| Fulton | 9.0 | 77.05 | 1.7 | 161.5 |
| Genesee | 9.3 | 72.36 | 2.6 | 166.9 |
| Greene | 8.8 | 79.08 | 4.8 | 192.3 |
| Hamilton | s | NA | 0.0* | 175.2 |
| Herkimer | 7.5 | 70.15 | 2.6 | 178.2 |
| Jefferson | 8.8 | 59.69 | 3.8 | 142.9 |
| Lewis | 7.0 | 65.46 | 3.0 | 154.7 |
| Livingston | 7.8 | 68.43 | 2.2 | 127.8 |
| Madison | 9.7 | 62.76 | 1.8 | 122.3 |
| Monroe | 10.0 | 65.74 | 7.4 | 152.4 |
| Montgomery | 9.0 | 76.86 | 3.0 | 200.1 |
| Nassau | 9.2 | 67.06 | 7.1 | 169.5 |
| Niagara | 10.0 | 78.86 | 3.8 | 182.8 |
| Oneida | 9.8 | 69.56 | 3.7 | 176.6 |
| Onondaga | 9.2 | 66.68 | 5.2 | 145.8 |
| Ontario | 9.1 | 76.79 | 2.2 | 147.1 |
| Orange | 8.3 | 53.48 | 6.7 | 141.8 |
| Orleans | 9.6 | 76.96 | 1.6 | 158.1 |
| Oswego | 7.9 | 69.43 | 3.3 | 161.8 |
| Otsego | 7.5 | 68.94 | 2.5 | 178.7 |
| Putnam | 8.7 | 64.47 | 4.6 | 114.9 |
| Rensselaer | 10.2 | 66.05 | 3.5 | 148.5 |
| Rockland | 6.3 | 54.49 | 4.5 | 104.4 |
| St. Lawrence | 8.7 | 62.57 | 3.2 | 157.0 |
| Saratoga | 8.1 | 68.64 | 2.6 | 130.9 |
| Schenectady | 10.3 | 64.28 | 5.0 | 153.4 |
| Schoharie | 8.5 | 69.63 | 2.5 | 142.6 |
| Schuyler | 5.2 | 77.66 | 1.5* | 138.4 |
| Seneca | 8.4 | 70.07 | 1.7 | 142.1 |
| Steuben | 8.0 | 69.41 | 2.7 | 135.1 |
| Suffolk | 10.0 | 68.67 | 5.1 | 173.8 |
| Sullivan | 10.3 | 61.70 | 5.8 | 181.8 |
| Tioga | 8.6 | 76.21 | 2.1 | 84.3 |
| Tompkins | 6.5 | 49.56 | 1.7 | 80.4 |
| Ulster | 9.0 | 66.57 | 5.1 | 162.6 |
| Warren | 10.2 | 80.66 | 4.3 | 173.6 |
| Washington | 11.1 | 75.01 | 3.8 | 161.6 |
| Wayne | 8.8 | 72.46 | 4.3 | 191.0 |
| Westchester | 9.1 | 60.86 | 7.1 | 136.8 |
| Wyoming | 10.3 | 75.61 | 3.5 | 155.1 |
| Yates | 6.0 | 73.05 | 2.0 | 130.7 |
| Bronx | 10.3 | 45.87 | 28.4 | 172.5 |
| Kings | 8.5 | 47.96 | 10.7 | 132.1 |
| New York | 8.7 | 53.95 | 8.4 | 117.3 |
| Queens | 9.1 | 51.60 | 7.8 | 141.7 |
| Richmond | 9.2 | 65.15 | 7.9 | 160.8 |
They are:
- cancer_mortality_per_10k: percentage of cancer mortality
per 100 thousands people in each NY county (Cancer Indicator)
- asthma_hosp_rate_per_10k: percentage of asthma
hospitalization per 10 thousands people in each NY county (Respiratory
Disease Indicator)
- cardio_hosp_rate_per_10k“: percentage of
cardiovascular-disease-related hospitalization per 10 thousands people
in each NY county (Cardiovascular Disease Indicator)
- premature_percentage: percentage of children being born
prematurely (<37 gestational weeks) in each NY county
Here we perform inner_join() to create 1 bigger dataset
called demographic_merge. Then, we join them with our
lowbirthweight & pm2_5 to make a finalized
data frame called merge. And, we will use this for
regression model.
demographic_merge <- age %>%
inner_join(sex, by = "county") %>%
inner_join(income, by = "county") %>%
inner_join(race_merge, by = "county") %>%
inner_join(edu, by = "county") %>%
mutate(county = str_replace (county, "St ", "St. "))
merge <- lowbirthweight %>%
inner_join(pm2_5, by = "county") %>%
inner_join(health_merge, by = "county") %>%
inner_join(demographic_merge, by = "county")
merge <- merge %>%
select (county, annual_pm2.5,everything())%>%
mutate(asthma_hosp_rate_per_10k = as.numeric(asthma_hosp_rate_per_10k))%>%
mutate(cardio_hosp_rate_per_10k = as.numeric(cardio_hosp_rate_per_10k)) %>%
mutate(premature_percentage = as.numeric(premature_percentage))%>%
mutate(cancer_mortality_per_10k = as.numeric(cancer_mortality_per_10k))
merge %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| county | annual_pm2.5 | percent_lowbirthweight | premature_percentage | cancer_mortality_per_10k | asthma_hosp_rate_per_10k | cardio_hosp_rate_per_10k | median_age | percent_male | percent_high_income | percent_non_hisp_white | percent_non_hisp_black | percent_hisp_white | percent_hisp_black | percent_other | percentage_high_education |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nassau | 8.7 | 7.9 | 9.2 | 67.06 | 7.1 | 169.5 | 41.8 | 0.491 | 0.712 | 0.583 | 0.110 | 0.063 | 0.006 | 0.239 | 0.475 |
| Suffolk | 6.8 | 8.0 | 10.0 | 68.67 | 5.1 | 173.8 | 41.7 | 0.495 | 0.669 | 0.664 | 0.069 | 0.098 | 0.006 | 0.162 | 0.383 |
| Bronx | 8.3 | 10.2 | 10.3 | 45.87 | 28.4 | 172.5 | 34.8 | 0.474 | 0.296 | 0.090 | 0.285 | 0.109 | 0.060 | 0.456 | 0.209 |
| Kings | 7.9 | 7.8 | 8.5 | 47.96 | 10.7 | 132.1 | 35.7 | 0.476 | 0.466 | 0.367 | 0.286 | 0.054 | 0.018 | 0.274 | 0.395 |
| New York | 10.4 | 8.0 | 8.7 | 53.95 | 8.4 | 117.3 | 38.1 | 0.476 | 0.572 | 0.467 | 0.121 | 0.068 | 0.019 | 0.324 | 0.626 |
| Queens | 8.3 | 8.7 | 9.1 | 51.60 | 7.8 | 141.7 | 39.6 | 0.488 | 0.505 | 0.246 | 0.168 | 0.090 | 0.010 | 0.486 | 0.339 |
| Richmond | 7.8 | 7.8 | 9.2 | 65.15 | 7.9 | 160.8 | 40.4 | 0.488 | 0.567 | 0.596 | 0.091 | 0.097 | 0.010 | 0.206 | 0.349 |
| Dutchess | 7.7 | 7.3 | 9.0 | 62.68 | 6.5 | 141.6 | 42.5 | 0.500 | 0.561 | 0.699 | 0.099 | 0.053 | 0.007 | 0.141 | 0.377 |
| Orange | 6.4 | 7.0 | 8.3 | 53.48 | 6.7 | 141.8 | 37.0 | 0.504 | 0.558 | 0.622 | 0.100 | 0.085 | 0.010 | 0.183 | 0.314 |
| Putnam | 7.9 | 6.6 | 8.7 | 64.47 | 4.6 | 114.9 | 44.2 | 0.503 | 0.672 | 0.764 | 0.029 | 0.068 | 0.006 | 0.133 | 0.415 |
| Rockland | 8.8 | 5.6 | 6.3 | 54.49 | 4.5 | 104.4 | 35.6 | 0.494 | 0.612 | 0.629 | 0.108 | 0.065 | 0.007 | 0.191 | 0.420 |
| Sullivan | 7.2 | 8.6 | 10.3 | 61.70 | 5.8 | 181.8 | 42.3 | 0.514 | 0.418 | 0.702 | 0.080 | 0.066 | 0.005 | 0.147 | 0.270 |
| Ulster | 7.4 | 7.7 | 9.0 | 66.57 | 5.1 | 162.6 | 44.2 | 0.499 | 0.476 | 0.760 | 0.056 | 0.040 | 0.004 | 0.139 | 0.345 |
| Westchester | 9.3 | 7.8 | 9.1 | 60.86 | 7.1 | 136.8 | 41.2 | 0.487 | 0.624 | 0.524 | 0.133 | 0.075 | 0.013 | 0.255 | 0.506 |
| Albany | 7.9 | 8.2 | 9.7 | 65.59 | 5.7 | 148.9 | 38.0 | 0.486 | 0.494 | 0.710 | 0.114 | 0.027 | 0.007 | 0.142 | 0.443 |
| Columbia | 7.0 | 6.9 | 9.0 | 82.23 | 4.7 | 168.6 | 48.9 | 0.503 | 0.489 | 0.848 | 0.039 | 0.022 | 0.002 | 0.089 | 0.349 |
| Greene | 6.8 | 6.5 | 8.8 | 79.08 | 4.8 | 192.3 | 46.6 | 0.527 | 0.419 | 0.838 | 0.053 | 0.035 | 0.004 | 0.070 | 0.256 |
| Rensselaer | 7.5 | 8.6 | 10.2 | 66.05 | 3.5 | 148.5 | 40.1 | 0.497 | 0.514 | 0.820 | 0.061 | 0.020 | 0.004 | 0.094 | 0.343 |
| Saratoga | 7.7 | 6.5 | 8.1 | 68.64 | 2.6 | 130.9 | 43.2 | 0.496 | 0.591 | 0.894 | 0.015 | 0.019 | 0.002 | 0.069 | 0.427 |
| Schenectady | 7.5 | 9.2 | 10.3 | 64.28 | 5.0 | 153.4 | 39.9 | 0.490 | 0.472 | 0.716 | 0.094 | 0.024 | 0.005 | 0.162 | 0.333 |
| Fulton | 6.7 | 8.8 | 9.0 | 77.05 | 1.7 | 161.5 | 44.3 | 0.501 | 0.364 | 0.906 | 0.014 | 0.016 | 0.001 | 0.063 | 0.182 |
| Herkimer | 6.5 | 6.4 | 7.5 | 70.15 | 2.6 | 178.2 | 44.3 | 0.499 | 0.410 | 0.932 | 0.012 | 0.012 | 0.000 | 0.045 | 0.214 |
| Montgomery | 6.9 | 8.0 | 9.0 | 76.86 | 3.0 | 200.1 | 40.5 | 0.496 | 0.345 | 0.794 | 0.026 | 0.056 | 0.004 | 0.120 | 0.192 |
| Otsego | 6.4 | 5.6 | 7.5 | 68.94 | 2.5 | 178.7 | 42.2 | 0.488 | 0.400 | 0.904 | 0.018 | 0.018 | 0.002 | 0.058 | 0.333 |
| Schoharie | 6.5 | 5.1 | 8.5 | 69.63 | 2.5 | 142.6 | 45.7 | 0.500 | 0.423 | 0.910 | 0.013 | 0.015 | 0.003 | 0.059 | 0.232 |
| Clinton | 7.0 | 8.1 | 8.7 | 68.46 | 2.5 | 147.1 | 40.7 | 0.517 | 0.408 | 0.897 | 0.039 | 0.007 | 0.001 | 0.056 | 0.254 |
| Essex | 3.4 | 7.3 | 8.9 | 72.62 | NA | 105.8 | 48.3 | 0.522 | 0.408 | 0.912 | 0.029 | 0.010 | 0.003 | 0.046 | 0.305 |
| Franklin | 7.4 | 6.7 | 8.5 | 67.53 | 1.9 | 117.5 | 40.6 | 0.546 | 0.358 | 0.811 | 0.051 | 0.008 | 0.001 | 0.129 | 0.216 |
| Hamilton | 6.0 | 0.0 | NA | NA | NA | 175.2 | 56.4 | 0.506 | 0.432 | 0.921 | 0.010 | 0.011 | 0.000 | 0.058 | 0.205 |
| Warren | 7.2 | 6.9 | 10.2 | 80.66 | 4.3 | 173.6 | 46.8 | 0.495 | 0.466 | 0.930 | 0.012 | 0.014 | 0.002 | 0.043 | 0.331 |
| Washington | 7.3 | 8.3 | 11.1 | 75.01 | 3.8 | 161.6 | 44.4 | 0.522 | 0.416 | 0.911 | 0.027 | 0.011 | 0.003 | 0.048 | 0.206 |
| Jefferson | 6.8 | 7.3 | 8.8 | 59.69 | 3.8 | 142.9 | 32.9 | 0.526 | 0.377 | 0.809 | 0.055 | 0.039 | 0.004 | 0.093 | 0.243 |
| Lewis | 6.1 | 6.1 | 7.0 | 65.46 | 3.0 | 154.7 | 42.0 | 0.504 | 0.390 | 0.946 | 0.009 | 0.005 | 0.007 | 0.032 | 0.188 |
| St. Lawrence | 7.7 | 7.2 | 8.7 | 62.57 | 3.2 | 157.0 | 39.0 | 0.513 | 0.356 | 0.917 | 0.025 | 0.006 | 0.001 | 0.050 | 0.236 |
| Cayuga | 6.4 | 7.2 | 8.8 | 72.01 | 2.2 | 169.5 | 43.3 | 0.516 | 0.397 | 0.888 | 0.039 | 0.013 | 0.001 | 0.059 | 0.222 |
| Cortland | 6.1 | 7.5 | 8.9 | 68.34 | 4.7 | 158.7 | 36.5 | 0.493 | 0.404 | 0.914 | 0.019 | 0.016 | 0.000 | 0.051 | 0.281 |
| Madison | 6.1 | 7.4 | 9.7 | 62.76 | 1.8 | 122.3 | 42.5 | 0.496 | 0.417 | 0.917 | 0.015 | 0.011 | 0.000 | 0.056 | 0.262 |
| Oneida | 6.5 | 7.9 | 9.8 | 69.56 | 3.7 | 176.6 | 40.8 | 0.502 | 0.409 | 0.811 | 0.057 | 0.027 | 0.003 | 0.101 | 0.262 |
| Onondaga | 5.1 | 8.2 | 9.2 | 66.68 | 5.2 | 145.8 | 39.3 | 0.485 | 0.443 | 0.760 | 0.103 | 0.020 | 0.006 | 0.110 | 0.365 |
| Oswego | 6.4 | 6.0 | 7.9 | 69.43 | 3.3 | 161.8 | 40.7 | 0.503 | 0.416 | 0.928 | 0.010 | 0.012 | 0.001 | 0.050 | 0.208 |
| Broome | 6.9 | 7.9 | 9.4 | 69.59 | 5.0 | 171.8 | 39.7 | 0.495 | 0.362 | 0.824 | 0.050 | 0.018 | 0.003 | 0.105 | 0.289 |
| Chenango | 6.5 | 7.6 | 8.2 | 73.57 | 2.0 | 177.5 | 44.7 | 0.502 | 0.364 | 0.941 | 0.006 | 0.014 | 0.001 | 0.038 | 0.183 |
| Delaware | 6.6 | 7.2 | 8.7 | 75.42 | 2.8 | 181.5 | 48.5 | 0.507 | 0.334 | 0.912 | 0.016 | 0.022 | 0.002 | 0.048 | 0.230 |
| Tioga | 6.6 | 7.1 | 8.6 | 76.21 | 2.1 | 84.3 | 44.5 | 0.499 | 0.428 | 0.943 | 0.009 | 0.010 | 0.000 | 0.037 | 0.258 |
| Tompkins | 6.2 | 5.7 | 6.5 | 49.56 | 1.7 | 80.4 | 31.8 | 0.497 | 0.427 | 0.769 | 0.035 | 0.023 | 0.005 | 0.168 | 0.553 |
| Chemung | 6.2 | 7.9 | 8.4 | 72.31 | 4.2 | 161.9 | 41.4 | 0.499 | 0.377 | 0.853 | 0.058 | 0.016 | 0.002 | 0.072 | 0.239 |
| Livingston | 6.7 | 7.2 | 7.8 | 68.43 | 2.2 | 127.8 | 41.7 | 0.508 | 0.428 | 0.895 | 0.026 | 0.017 | 0.002 | 0.060 | 0.289 |
| Monroe | 7.4 | 9.1 | 10.0 | 65.74 | 7.4 | 152.4 | 39.2 | 0.485 | 0.444 | 0.698 | 0.140 | 0.041 | 0.009 | 0.113 | 0.400 |
| Ontario | 6.6 | 7.0 | 9.1 | 76.79 | 2.2 | 147.1 | 43.6 | 0.492 | 0.474 | 0.891 | 0.020 | 0.022 | 0.003 | 0.064 | 0.375 |
| Schuyler | 6.0 | 2.8 | 5.2 | 77.66 | NA | 138.4 | 47.1 | 0.504 | 0.411 | 0.944 | 0.009 | 0.009 | 0.001 | 0.038 | 0.243 |
| Seneca | 6.4 | 6.8 | 8.4 | 70.07 | 1.7 | 142.1 | 42.5 | 0.527 | 0.373 | 0.879 | 0.050 | 0.018 | 0.002 | 0.051 | 0.215 |
| Steuben | 4.8 | 7.2 | 8.0 | 69.41 | 2.7 | 135.1 | 42.9 | 0.502 | 0.384 | 0.929 | 0.016 | 0.009 | 0.001 | 0.046 | 0.255 |
| Wayne | 6.7 | 6.8 | 8.8 | 72.46 | 4.3 | 191.0 | 43.9 | 0.500 | 0.424 | 0.889 | 0.021 | 0.019 | 0.001 | 0.071 | 0.244 |
| Yates | 6.0 | 4.0 | 6.0 | 73.05 | 2.0 | 130.7 | 40.9 | 0.483 | 0.372 | 0.948 | 0.008 | 0.015 | 0.000 | 0.029 | 0.245 |
| Allegany | 6.5 | 6.7 | 8.8 | 66.18 | 3.1 | 151.8 | 39.2 | 0.506 | 0.342 | 0.941 | 0.018 | 0.009 | 0.001 | 0.031 | 0.224 |
| Cattaraugus | 7.3 | 7.8 | 8.4 | 68.99 | 1.9 | 159.0 | 42.3 | 0.501 | 0.335 | 0.897 | 0.013 | 0.013 | 0.001 | 0.076 | 0.200 |
| Chautauqua | 6.6 | 8.4 | 9.3 | 73.68 | 2.6 | 90.2 | 42.8 | 0.497 | 0.323 | 0.865 | 0.023 | 0.033 | 0.005 | 0.074 | 0.233 |
| Erie | 7.6 | 8.8 | 10.2 | 71.48 | 5.1 | 159.5 | 40.2 | 0.487 | 0.419 | 0.746 | 0.124 | 0.020 | 0.004 | 0.107 | 0.359 |
| Genesee | 7.3 | 6.8 | 9.3 | 72.36 | 2.6 | 166.9 | 42.6 | 0.504 | 0.419 | 0.900 | 0.020 | 0.012 | 0.001 | 0.066 | 0.220 |
| Niagara | 8.7 | 8.2 | 10.0 | 78.86 | 3.8 | 182.8 | 43.3 | 0.490 | 0.404 | 0.845 | 0.066 | 0.017 | 0.002 | 0.071 | 0.263 |
| Orleans | 7.3 | 6.5 | 9.6 | 76.96 | 1.6 | 158.1 | 43.2 | 0.503 | 0.361 | 0.857 | 0.056 | 0.021 | 0.001 | 0.065 | 0.172 |
| Wyoming | 7.0 | 8.2 | 10.3 | 75.61 | 3.5 | 155.1 | 42.6 | 0.534 | 0.391 | 0.888 | 0.040 | 0.015 | 0.002 | 0.056 | 0.181 |
This merge dataset is comprised of 62 rows and 16
columns of data, with first column being the name of the
county, and other 15 beings all continuous variables of
identified indicators of the whole 62 counties of NY state.
uscounties <- uscounties %>%
filter (state_id == "NY") %>%
select (county, lat, lng)
map <- merge %>%
inner_join(uscounties, by ="county")
write.csv(map, "NY_map.csv", row.names = FALSE)
This file is specifically made for the purpose of making map (in Shiny App). We obtained latitude and longitude data of all NY counties from SimpleMaps Database (https://simplemaps.com/data/us-counties).
Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
First, we want to understand all the chosen variables we have. Since
they are all continuous, we want to look at how many data points are
being missing, their mean, median, and standard deviation. The code
chunk below summarizes all of that statistics from our
merge dataset.
merge_cont =
merge %>%
select(-county)
skim_table <- skim(merge_cont)
numeric_summary <- skim_table %>%
select(Variables = skim_variable, Nmiss = n_missing, Mean = numeric.mean, Median = numeric.p50, STD = numeric.sd)
kable(numeric_summary, "html", digits = 1) %>%
kable_styling(full_width = FALSE, position = "center", font_size = 13) %>%
column_spec(1, width = "3cm") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "2cm")
| Variables | Nmiss | Mean | Median | STD |
|---|---|---|---|---|
| annual_pm2.5 | 0 | 7.0 | 6.8 | 1.1 |
| percent_lowbirthweight | 0 | 7.2 | 7.3 | 1.5 |
| premature_percentage | 1 | 8.8 | 8.9 | 1.1 |
| cancer_mortality_per_10k | 1 | 68.0 | 68.9 | 8.0 |
| asthma_hosp_rate_per_10k | 3 | 4.4 | 3.7 | 3.8 |
| cardio_hosp_rate_per_10k | 0 | 151.1 | 154.9 | 25.9 |
| median_age | 0 | 41.9 | 42.1 | 3.9 |
| percent_male | 0 | 0.5 | 0.5 | 0.0 |
| percent_high_income | 0 | 0.4 | 0.4 | 0.1 |
| percent_non_hisp_white | 0 | 0.8 | 0.9 | 0.2 |
| percent_non_hisp_black | 0 | 0.1 | 0.0 | 0.1 |
| percent_hisp_white | 0 | 0.0 | 0.0 | 0.0 |
| percent_hisp_black | 0 | 0.0 | 0.0 | 0.0 |
| percent_other | 0 | 0.1 | 0.1 | 0.1 |
| percentage_high_education | 0 | 0.3 | 0.3 | 0.1 |
The table provides descriptive statistics for various variables in
the dataset. Key insights include the average annual_pm2.5
level being 7.0, with a small standard deviation of 1.1, indicating
relatively consistent air quality. Additionally, variables such as
premature_percentage and
cancer_mortality_per_10k show slight variability with a
small number of missing values, while hospitalization rates and
demographic characteristics exhibit diverse distributions. Overall, the
summary highlights the central tendency, variability, and distribution
characteristics of the dataset’s variables.
Hover over the interactive graph or zoom in for more detailed information.
Then, we aimed to look that the distribution of
percent_high_income across all counties. We performed a
general barplot that allowed us to understand the overall
distribution.
merge %>%
plot_ly(
x = ~reorder(county, percent_high_income),
y = ~percent_high_income,
type = "bar",
marker = list(color = "red1")
) %>%
layout(
title = "Percentage of High Income",
xaxis = list(title = "County Name", categoryorder = "total descending"),
yaxis = list(title = "Percentage"),
barmode = "stack"
)
The bar chart depicts the “Percentage of High Income” across 62 counties of New York State, ordered from the highest to the lowest percentage. Nassau county has the highest percentage of household with an annual income of more than $75,000 (categorized as high income), while that of the Bronx county is the lowest.
Hover over the interactive graph or zoom in for more detailed information.
Due to the nature of variety of racial group indicators of each county, we wanted to make a stacked bargraph to see the different between each county’s racial composition.
race_plot <- merge %>%
select (county, percent_non_hisp_white, percent_non_hisp_black, percent_hisp_white, percent_hisp_black, percent_other) %>%
pivot_longer(
cols = starts_with("percent_"), names_to = "race", values_to = "percentage")
race_plot%>%
plot_ly(x = ~county, y = ~percentage, type = "bar",color = ~race,colors = "RdYlGn", hoverinfo = "y+name") %>%
layout(barmode = "stack",
title = "Racial Composition in NY county",
xaxis = list(title = "County"),
yaxis = list(title = "Percentage (%)"))
This stacked bar chart displays the racial demographics across 62 New York counties. Each bar represents a county and is divided into segments that reflect the percentage of the county’s population of the following racial categories: non-Hispanic White, non-Hispanic Black, Hispanic White, Hispanic Black, and Other. The majority of New York State population is non-Hispanic-White, which is also the major racial group in almost all counties, except in Bronx county and Queens county.
The prevalence of our 5 outcomes of interest across 62 counties are illustrated in these 5 graphs below. For each outcome, we have highlighted the TOP 5 to be circled with red color, while the bottom 5 will be circled with blue color, for easier understanding as there are many data points for NY State.
Hover over the interactive graph or zoom in for more detailed information.
# Create a scatter plot
LBR_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~percent_lowbirthweight,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Low Birthweight",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Percentage of Low Birthweight"),
showlegend = FALSE
)
# Identify the top and bottom points
top_points <- merge %>% arrange(desc(percent_lowbirthweight)) %>% slice_head(n = 5)
bottom_points <- merge %>% arrange(percent_lowbirthweight) %>% slice_head(n = 5)
# Identify the other points
suppressMessages({other_points <- merge %>%
anti_join(top_points) %>%
anti_join(bottom_points)})
# Add markers for the top and bottom points
LBR_graph <- LBR_graph %>%
add_markers(
data = top_points,
x = ~county,
y = ~percent_lowbirthweight,
color = I("red"), # Customize color for top points
size = I(10), # Customize marker size for top points
name = "Top 5"
) %>%
add_markers(
data = bottom_points,
x = ~county,
y = ~percent_lowbirthweight,
color = I("blue"), # Customize color for bottom points
size = I(10), # Customize marker size for bottom points
name = "Bottom 5") %>%
add_markers(
data = other_points,
x = ~county,
y = ~percent_lowbirthweight,
color = I("grey"), # Customize color for other points
size = I(5), # Customize marker size for other points
name = "Other Points"
)
# Display the plot
LBR_graph
TOP 5: (Bronx, 10.2), (Schenectady, 9.2), (Monroe, 9.1), (Fulton, 8.8), (Erie, 8.8) .
BOTTOM 5: (Hamilton,0), (Schuyler,2.8), (Yates, 4), (Schoharie, 5.1), (Rockland, 5.6).
Hover over the interactive graph or zoom in for more detailed information.
# Create a scatter plot
premature_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~premature_percentage,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Premature Birth",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Percentage of Premature Birth"),
showlegend = FALSE
)
# Identify the top and bottom points
top_points1 <- merge %>% arrange(desc(premature_percentage)) %>% slice_head(n = 5)
bottom_points1 <- merge %>% arrange(premature_percentage) %>% slice_head(n = 5)
# Identify the other points
suppressMessages({other_points1 <- merge %>%
anti_join(top_points1) %>%
anti_join(bottom_points1)})
# Add markers for the top and bottom points
premature_graph <- premature_graph %>%
add_markers(
data = top_points1,
x = ~county,
y = ~premature_percentage,
color = I("red"), # Customize color for top points
size = I(10), # Customize marker size for top points
name = "Top 5"
) %>%
add_markers(
data = bottom_points1,
x = ~county,
y = ~premature_percentage,
color = I("blue"), # Customize color for bottom points
size = I(10), # Customize marker size for bottom points
name = "Bottom 5") %>%
add_markers(
data = other_points1,
x = ~county,
y = ~premature_percentage,
color = I("grey"), # Customize color for other points
size = I(5), # Customize marker size for other points
name = "Other Points"
)
# Display the plot
premature_graph
TOP 5: (Washington, 11.1), (Schenectady, 10.3), (Bronx, 10.3), (Sullivan, 10.3), (Wyoming, 10.3) .
BOTTOM 5: (Schuyler,5.2), (Yates , 6), (Rockland, 6.3), (Tompskin, 6.5), (Lewis, 7).
Hover over the interactive graph or zoom in for more detailed information.
# Create a scatter plot
asthma_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~asthma_hosp_rate_per_10k,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Asthma Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
# Identify the top and bottom points
top_points2 <- merge %>% arrange(desc(asthma_hosp_rate_per_10k)) %>% slice_head(n = 5)
bottom_points2 <- merge %>% arrange(asthma_hosp_rate_per_10k) %>% slice_head(n = 5)
# Identify the other points
suppressMessages({other_points2 <- merge %>%
anti_join(top_points2) %>%
anti_join(bottom_points2)})
# Add markers for the top and bottom points
asthma_graph <- asthma_graph %>%
add_markers(
data = top_points2,
x = ~county,
y = ~asthma_hosp_rate_per_10k,
color = I("red"), # Customize color for top points
size = I(10), # Customize marker size for top points
name = "Top 5"
) %>%
add_markers(
data = bottom_points2,
x = ~county,
y = ~asthma_hosp_rate_per_10k,
color = I("blue"), # Customize color for bottom points
size = I(10), # Customize marker size for bottom points
name = "Bottom 5") %>%
add_markers(
data = other_points2,
x = ~county,
y = ~asthma_hosp_rate_per_10k,
color = I("grey"), # Customize color for other points
size = I(5), # Customize marker size for other points
name = "Other Points"
)
# Display the plot
asthma_graph
TOP 5: (Bronx, 28.4), (Kings, 10.7), (New York, 8.4), (Richmond, 7.9), (Queens, 7.8) .
BOTTOM 5: (Orleans,1.6), (Seneca , 1.7), (Tompskin, 1.7), (Fulton, 1.7), (Madison, 1.8).
Hover over the interactive graph or zoom in for more detailed information.
# Create a scatter plot
cancer_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cancer_mortality_per_10k,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cancer Mortality",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
# Identify the top and bottom points
top_points3 <- merge %>% arrange(desc(cancer_mortality_per_10k)) %>% slice_head(n = 5)
bottom_points3 <- merge %>% arrange(cancer_mortality_per_10k) %>% slice_head(n = 5)
# Identify the other points
suppressMessages({other_points3 <- merge %>%
anti_join(top_points3) %>%
anti_join(bottom_points3)})
# Add markers for the top and bottom points
cancer_graph <- cancer_graph %>%
add_markers(
data = top_points3,
x = ~county,
y = ~cancer_mortality_per_10k,
color = I("red"), # Customize color for top points
size = I(10), # Customize marker size for top points
name = "Top 5"
) %>%
add_markers(
data = bottom_points3,
x = ~county,
y = ~cancer_mortality_per_10k,
color = I("blue"), # Customize color for bottom points
size = I(10), # Customize marker size for bottom points
name = "Bottom 5") %>%
add_markers(
data = other_points3,
x = ~county,
y = ~cancer_mortality_per_10k,
color = I("grey"), # Customize color for other points
size = I(5), # Customize marker size for other points
name = "Other Points"
)
# Display the plot
cancer_graph
TOP 5: (Columbia, 82.23), (Warren, 80.66), (Greene, 79.08), (Niagara, 78.86), (Schuyler, 77.66) .
BOTTOM 5: (Bronx ,45.87), (Kings , 47.96), (Tompskin, 49.56), (Queens, 51.6), (Orange, 53.48) .
Hover over the interactive graph or zoom in for more detailed information.
# Create a scatter plot
cardio_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cardio_hosp_rate_per_10k,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cardiovascular Disease Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
# Identify the top and bottom points
top_points4 <- merge %>% arrange(desc(cardio_hosp_rate_per_10k)) %>% slice_head(n = 5)
bottom_points4 <- merge %>% arrange(cardio_hosp_rate_per_10k) %>% slice_head(n = 5)
# Identify the other points
suppressMessages({other_points4 <- merge %>%
anti_join(top_points4) %>%
anti_join(bottom_points4)})
# Add markers for the top and bottom points
cardio_graph <- cardio_graph %>%
add_markers(
data = top_points4,
x = ~county,
y = ~cardio_hosp_rate_per_10k,
color = I("red"), # Customize color for top points
size = I(10), # Customize marker size for top points
name = "Top 5"
) %>%
add_markers(
data = bottom_points4,
x = ~county,
y = ~cardio_hosp_rate_per_10k,
color = I("blue"), # Customize color for bottom points
size = I(10), # Customize marker size for bottom points
name = "Bottom 5") %>%
add_markers(
data = other_points4,
x = ~county,
y = ~cardio_hosp_rate_per_10k,
color = I("grey"), # Customize color for other points
size = I(5), # Customize marker size for other points
name = "Other Points"
)
# Display the plot
cardio_graph
TOP 5: (Montgomery, 200.1), (Greene, 192.3), (Wayne, 191), (Niagara, 182.8), (Sullivan, 181.8) .
BOTTOM 5: (Tompskin ,80.4), (Tioga , 84.3), (Chautauqua, 90.2), (Rockland, 104.4), (Essex, 105.8) .
If you undertake formal statistical analyses, describe these in detail
After cleaning the data, we then look at the effect that the annual concentration of pm 2.5 separately has on multiple outcomes among 62 counties in New York, including low birth weight rate, premature birth rate, cancer mortality rate, asthma-related hospitalization rate, and cardiovascular-disease-related hospitalization rate.
Since all of the outcomes are measured as rates (i.e., continuous), we use linear regression (LR) models to assess the effect of annual concentration of pm 2.5 on the outcomes of interests, adjusting for age, sex, ethnicity, education, income. In our models:
Age is reported as a continuous variable that reflects the median age of each county.
Sex is reported as a continuous variable that reflects the percentage of male in each county. The variable that reflects the percentage of female in each county is not included in our model to avoid perfect multicollinearity.
Ethnicity is reported as a continuous variable that reflects the percentage of different ethnicity groups in each county, including Hispanic-Black, Hispanic-White, non-Hispanic-Black, non-Hispanic-White, and Others (if Asian, Native American, or belong to two or more race). The variable that reflects the percentage of people who belong to the category Others is not included in our model to avoid perfect multicollinearity.
Education is reported as a continuous variable that reflects the percentage of people who have obtained one or more higher education degrees in each county.
Income is reported as a continuous variable that reflects the percentage of household of each county that have annual income exceeding $75,000.
For each of the five outcomes of interest, we first start with the full model:
Outcome = Annual pm 2.5 concentration + median age + percentage of male + percentage of Hispanic-Black + percentage of Hispanic-White + percentage of non-Hispanic-Black + percentage of non-Hispanic-White + percentage of household income exceeding 75,000 USD + percentage of people obtained higher education
We then perform bidirectional stepwise selection on all five models and output the best models. The full model and the resulting best models after stepwise selection for each outcome are detailed below:
Full model
lm_pm2.5_birthweight_adjusted <-lm(percent_lowbirthweight ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_birthweight_adjusted_best <- step(lm_pm2.5_birthweight_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_birthweight_adjusted_best)%>%
tab_model()
| percent lowbirthweight | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 11.53 | 7.18 – 15.88 | <0.001 |
| median age | -0.11 | -0.21 – -0.02 | 0.024 |
| percent non hisp black | 8.05 | 1.34 – 14.76 | 0.019 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.274 / 0.249 | ||
The best model that was outputted by the stepwise regression is
low birth weight = 11.53 - 0.11(median age) + 8.05 (percent_non_hisp_black)
On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 11.5% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.05%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
The adjusted R-squared of 0.249 implies that 24.9% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
Full model
lm_pm2.5_premature_adjusted <-lm(premature_percentage ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_premature_adjusted_best <- step(lm_pm2.5_premature_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_premature_adjusted_best)%>%
tab_model()
| premature percentage | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 4.88 | 0.96 – 8.80 | 0.016 |
| median age | 0.08 | -0.01 – 0.17 | 0.067 |
| percent non hisp black | 8.03 | 2.63 – 13.42 | 0.004 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.135 / 0.105 | ||
The best model that was outputted by the stepwise regression is
premature birth weight = 4.88 + 0.08(median age) + 8.03 (percent_non_hisp_black)
On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 4.88% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
Detailed below are the the initial model, the final model and the interpretation for each outcome.
Full model
lm_pm2.5_cancer_adjusted <-lm(cancer_mortality_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_cancer_adjusted_best<-step(lm_pm2.5_cancer_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cancer_adjusted_best)%>%
tab_model()
| cancer mortality per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -0.18 | -51.73 – 51.36 | 0.994 |
| annual pm2 5 | 1.33 | 0.10 – 2.55 | 0.035 |
| median age | 1.31 | 0.98 – 1.65 | <0.001 |
| percent high income | -21.35 | -37.32 – -5.38 | 0.010 |
| percent non hisp white | 46.40 | 20.09 – 72.71 | 0.001 |
| percent non hisp black | 53.63 | 2.27 – 104.99 | 0.041 |
| percent hisp white | 102.38 | 10.35 – 194.41 | 0.030 |
| percent hisp black | -178.74 | -417.28 – 59.79 | 0.139 |
| percent male | -57.41 | -140.92 – 26.09 | 0.174 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.809 / 0.780 | ||
The best model that was outputted by the stepwise regression is
cancer mortality rate = -1.85 + 13.27(annual pm 2.5) + 13.14(median age) - 213.49(percent high income) + 464.03(percent non hisp white) + 536.28(percent non hisp black) + 1023.82(percent hisp white) - 1787.43(percent hisp black) - 574.15(percent male)
On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has 0% of high income, has 0% of non-Hispanic-Black, has 0% of non-Hispanic-White, has 0% of Hispanic-Black, has 0% of Hispanic-White, has 0% of male, the expected low birth weight rate is -1.85 case per 100,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of annual pm 2.5 concentration, the expected cancer mortality rate of the county increases by 13.27 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected cancer mortality rate of the county decreases by 13.14 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of high income, the expected cancer mortality rate of the county decreases by 213.49 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of high income is significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-White, the expected cancer mortality rate of the county increases by 464.03 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-White is significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 536.28 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage Hispanic-White, the expected cancer mortality rate of the county increases by 1023.08 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of Hispanic-White is significantly different than 0.
On average, for every unit increase of percentage Hispanic-Black, the expected cancer mortality rate of the county decreases by 1787.43 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient percentage of Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of male, the expected cancer mortality rate of the county decreases by 574.15 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percentage of male is significantly different than 0.
The adjusted R-squared of 0.780 implies that 70.8% of the variation in the response variable can be explained by its linear relationship with the set of the 8 predictors (annual pm 2.5 conc., median age, percentage of high income, percentage of 4 ethnicity groups, percentage of male).
Detailed below are the the initial model, the final model and the interpretation for each outcome.
Full model
lm_pm2.5_asthma_adjusted <-lm(asthma_hosp_rate_per_10k ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_asthma_adjusted_best <- step(lm_pm2.5_asthma_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_asthma_adjusted_best)%>%
tab_model()
| asthma hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.36 | 0.06 – 2.67 | 0.041 |
| percent high income | 4.65 | 0.32 – 8.99 | 0.036 |
| percent non hisp black | 13.52 | 5.77 – 21.26 | 0.001 |
| percent hisp black | 366.67 | 314.84 – 418.50 | <0.001 |
| percentage high education | -5.31 | -9.55 – -1.06 | 0.015 |
| Observations | 59 | ||
| R2 / R2 adjusted | 0.935 / 0.930 | ||
The best model that was outputted by the stepwise regression is
asthma hospitalization rate = 1.36 + 4.65(percent high income) + 13.52(percent non hisp black) + 366.67 (percent hisp black) - 5.31(percent high education)
On average, for a county who has zero percent of high income, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-Black, has zero percent of high education, the expected asthma hospitalization rate is 1.85 case per 10,000 people (no practical meaning). Since the p-value is smaller han 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase in the percentage of high income, the expected asthma hospitalization rate of the county increases by 4.65 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of high incomeis significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 13.52 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of Hispanic-Black, the expected cancer mortality rate of the county increases by 366.67 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of high education, the expected cancer mortality rate of the county decreases by 5.31 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of higher education is significantly different than 0.
The adjusted R-squared of 0.930 implies that 93.0% of the variation in the response variable can be explained by its linear relationship with the set of the 4 predictors (percentage of high income, percentage of non-Hispanic-Black, percentage of Hispanic-Black, percentage of higher education).
Detailed below are the the initial model, the final model and the interpretation for each outcome.
Full model
lm_pm2.5_cardio_adjusted <-lm(cardio_hosp_rate_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_cardio_adjusted_best<-step(lm_pm2.5_cardio_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cardio_adjusted_best)%>%
tab_model()
| cardio hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 157.52 | -155.87 – 470.91 | 0.318 |
| annual pm2 5 | 6.97 | -0.14 – 14.08 | 0.055 |
| median age | 1.73 | 0.01 – 3.44 | 0.049 |
| percent non hisp white | 117.26 | -30.46 – 264.97 | 0.117 |
| percent non hisp black | 222.99 | -83.73 – 529.70 | 0.151 |
| percent hisp white | 439.56 | -49.52 – 928.63 | 0.077 |
| percentage high education | -148.61 | -224.74 – -72.49 | <0.001 |
| percent male | -405.53 | -920.48 – 109.43 | 0.120 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.329 / 0.242 | ||
The best model that was outputted by the stepwise regression is
cardiovascular disease hospitalization rate = 157.52 + 6.97(annual pm 2.5) + 1.73(median age) + 117.26 (percent non hisp white) + 222.99(percent non hisp black) + 439.56(percent hisp white) - 148.61.43(percent high education) - 405.53(percent male)
On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has zero percent of non-Hispanic-White, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-White, has zero percent of high education, has zero percent of male, the expected cardiovascular disease hospitalization rate is 157.52 case per 10,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of annual pm 2.5 concentration, the expected cardiovascular disease hospitalization rate of the county increases by 6.97cases per 10,000 people. However, since the p-value is smaller than 0.05, we do not have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected cardiovascular disease hospitalization rate of the county decreases by 1.73 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase of percent of non-Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 177.26 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percent of non-Hispanic-White is significantly different than 0.
On average, for every unit increase of percent of non-Hispanic-Black, the expected cardiovascular disease hospitalization rate of the county increases by 222.99 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percent of Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 439.56 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of Hispanic-White is significantly different than 0.
On average, for every unit increase of percent of higher education, the expected cardiovascular disease hospitalization rate of the county decreases by 148.61 cases per 10,000 people. Since the p-value is larger than 0.05, we have sufficient evidence to claim that the beta coefficient for percent of higher education is significantly different than 0.
On average, for every unit increase of percentage of male, the expected cardiovascular disease hospitalization rate of the county decreases by 405.53 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percentage male is significantly different than 0.
The adjusted R-squared of 0.242 implies that 24.2% of the variation in the response variable can be explained by its linear relationship with the set of the 7 predictors (annual pm 2.5 concentration, median age, percent non-Hispanic-White, percent non-Hispanic-Black, percent Hispanic-White, percentage of higher education, percentage of male).
What were your findings? Are they what you expect? What insights into the data can you make?